Search CORE

24 research outputs found

Profiling relational data: a survey

Author: Abedjan Ziawasch
Golab Lukasz
Naumann Felix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/08/2016
Field of study

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

DSpace@MIT

AutoML in Heavily Constrained Applications

Author: Abedjan Ziawasch
Lindauer Marius
Neutatz Felix
Publication venue
Publication date: 29/06/2023
Field of study

Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system's own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose Caml, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of Caml takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance

arXiv.org e-Print Archive

The Need for Incorporation of the Principles of Fiscal Sociology in Social Policy in Ukraine

Author: Consoli Sergio
Petkovic Milan
REFORGIATO RECUPERO DIEGO ANGELO GAETANO
Ziawasch Abedjan et al.
Publication venue: Волин. нац. ун-т ім.Лесі України
Publication date: 01/01/2012
Field of study

У статті запропоновано використати новий принцип фінансування соціальних видатків у країні з недо-статнім рівнем демократії в умовах економічної кризи, який пропонують назвати анти-оптимум Парето.В статье предлагается использовать новый принцип финансирования социальных расходов в стране с недостаточным уровнем демократии в условиях экономического кризиса, который предлагается назвать анти-оптимум Парето.In the article it is suggested to use new principle of financing social charges in a country with the insufficient level of democracy in the conditions of economic crisis, which it is suggested to name аnti-optimum of Pareto

Repository TU/e

Electronic Eastern European National University Institutional Repository

Pure OAI Repository

Archivio istituzionale della ricerca - Università di Cagliari

CERN Document Server

Unsupervised String Transformation Learning for Entity Consolidation

Author: Abedjan Ziawasch
Deng Dong
Elmagarmid Ahmed
Ilyas Ihab F.
Li Guoliang
Madden Samuel
Ouzzani Mourad
Stonebraker Michael
Tang Nan
Tao Wenbo
Publication venue
Publication date: 30/07/2018
Field of study

Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management (MDM) systems, can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way (i.e., they share a transformation). Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool

arXiv.org e-Print Archive

Crossref

Duplicate Table Detection with Xash

Author: Abedjan Ziawasch
Auer Sören
Esmailoghli Mahdi
Koch Maximilian
König-Ries Birgitta
Lehner Wolfgang
Scherzinger Stefanie
Vossen Gottfried
Publication venue: Bonn : Ges. für Informatik
Publication date: 01/01/2023
Field of study

Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates

Institutionelles Repositorium der Leibniz Universität Hannover

Advancing the discovery of unique column combinations

Author: Felix Naumann
Ziawasch Abedjan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Unique column combinations of a relational database table are sets of columns that contain only unique values. Discov-ering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are ei-ther brute force or have a high memory load and can thus be applied only to small datasets or samples. In this pa-per, the well-known Gordian algorithm [9] and “Apriori-based ” algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian com-bines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situa-tions

CiteSeerX

Crossref

SPRINT: Ranking Search Results by Paths

Author: Benjamin Emde
Christoph Böhm
Eyk Kny
Felix Naumann
Ziawasch Abedjan
Publication venue
Publication date
Field of study

Graph-structured data abounds and has become the subject of much attention in the past years, for instance when searching and analyzing social network structures. Measures such as the shortest path or the number of paths between two nodes are used as proxies for similarity or relevance[1]. These approaches benefit from the fact that the measures are determined from some context node, e.g., “me ” in a social network. With SPRINT, we apply these notions to a new domain, namely ranking web search results using the linkpath-structure among pages. SPRINT demonstrates the feasibility and effectiveness of Searching by Path Ranks on the INTernet with two use cases: First, we re-rank intranet search results based on the position of the user’s homepage on the graph. Second, as a live proof-of-concept we dynamically re-rank Wikipedia search results based on the currently viewed page: When viewing the Java software page, a search for “Sun ” ranks Sun Microsystems higher than the star at the center of our solar system. We evaluate the first use case with a user study. The second use case is the focus of the demonstration and allows users to actively test our system with any combination of context page and search term

CiteSeerX